{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1> Text Classification using TensorFlow on Cloud ML Engine </h1>\n",
    "\n",
    "This notebook illustrates:\n",
    "<ol>\n",
    "<li> Creating datasets for Machine Learning using BigQuery\n",
    "<li> Creating a text classification model using the high-level Estimator API \n",
    "<li> Training on Cloud ML Engine\n",
    "<li> Deploying model\n",
    "<li> Predicting with model\n",
    "</ol>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# change these to try this notebook out\n",
    "BUCKET = 'cloud-training-demos-ml'\n",
    "PROJECT = 'cloud-training-demos'\n",
    "REGION = 'us-central1'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['BUCKET'] = BUCKET\n",
    "os.environ['PROJECT'] = PROJECT\n",
    "os.environ['REGION'] = REGION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "%datalab project set -p $PROJECT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "!pip install --upgrade tensorflow"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.2.1\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "print tf.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The idea is to look at the title of a newspaper article and figure out whether the article came from the New York Times or from TechCrunch. There are very sophisticated approaches that we can try, but for now, let's go with something very simple.\n",
    "\n",
    "<h2> Data exploration and preprocessing in BigQuery </h2>\n",
    "<p>\n",
    "What does the Hacker News dataset look like?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "    <div class=\"bqtv\" id=\"1_149790510276\"><table><tr><th>url</th><th>title</th><th>score</th></tr><tr><td></td><td>Ask HN: What are some good gift ideas for hacker types?</td><td>44</td></tr><tr><td></td><td>Ask HN: After Google Apps and Outlook, which email provider for custom domain?</td><td>22</td></tr><tr><td>http://code.ipstenu.org/2011/the-legality-of-forking/</td><td>The Legality of Forking</td><td>17</td></tr><tr><td>https://medium.com/@polarrist/where-are-chernobyl-s-children-a-photojournalist-s-honest-project-in-the-age-of-disaster-tourism-4cd333ab80c7</td><td>Where Are Chernobyl’s Children?</td><td>50</td></tr><tr><td>http://www.bbc.co.uk/news/technology-19597437</td><td>Twitter hands over messages at heart of Occupy case</td><td>61</td></tr><tr><td>http://weblogs.asp.net/fbouma/archive/2013/08/13/windows-store-account-getting-rid-of-it-is-as-hard-as-signing-up.aspx</td><td>Windows Store dev account: getting rid of it is as hard as signing up </td><td>51</td></tr><tr><td>http://www.forbes.com/sites/andyellwood/2012/01/18/being-a-regular/</td><td>Being A Regular</td><td>28</td></tr><tr><td></td><td>Ask HN: The Road to Becoming an Angel or VC</td><td>12</td></tr><tr><td>http://personalmba.com/best-business-books/</td><td>Best Business Books</td><td>20</td></tr><tr><td>http://www.couch.io/migrating-to-couchdb</td><td>Migrating to CouchDB</td><td>48</td></tr></table></div>\n",
       "    <br />(rows: 10, time: 2.7s,   229MB processed, job: job_xEjVjj18rbx51BN1gSFipUuGdSE)<br />\n",
       "    <script>\n",
       "\n",
       "      require.config({\n",
       "        paths: {\n",
       "          d3: '//cdnjs.cloudflare.com/ajax/libs/d3/3.4.13/d3',\n",
       "          plotly: 'https://cdn.plot.ly/plotly-1.5.1.min.js?noext',\n",
       "          jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min'\n",
       "        },\n",
       "        map: {\n",
       "          '*': {\n",
       "            datalab: 'nbextensions/gcpdatalab'\n",
       "          }\n",
       "        },\n",
       "        shim: {\n",
       "          plotly: {\n",
       "            deps: ['d3', 'jquery'],\n",
       "            exports: 'plotly'\n",
       "          }\n",
       "        }\n",
       "      });\n",
       "\n",
       "      require(['datalab/charting', 'datalab/element!1_149790510276', 'base/js/events',\n",
       "          'datalab/style!/nbextensions/gcpdatalab/charting.css'],\n",
       "        function(charts, dom, events) {\n",
       "          charts.render('gcharts', dom, events, 'table', [], {\"rows\": [{\"c\": [{\"v\": \"\"}, {\"v\": \"Ask HN: What are some good gift ideas for hacker types?\"}, {\"v\": 44}]}, {\"c\": [{\"v\": \"\"}, {\"v\": \"Ask HN: After Google Apps and Outlook, which email provider for custom domain?\"}, {\"v\": 22}]}, {\"c\": [{\"v\": \"http://code.ipstenu.org/2011/the-legality-of-forking/\"}, {\"v\": \"The Legality of Forking\"}, {\"v\": 17}]}, {\"c\": [{\"v\": \"https://medium.com/@polarrist/where-are-chernobyl-s-children-a-photojournalist-s-honest-project-in-the-age-of-disaster-tourism-4cd333ab80c7\"}, {\"v\": \"Where Are Chernobyl\\u2019s Children?\"}, {\"v\": 50}]}, {\"c\": [{\"v\": \"http://www.bbc.co.uk/news/technology-19597437\"}, {\"v\": \"Twitter hands over messages at heart of Occupy case\"}, {\"v\": 61}]}, {\"c\": [{\"v\": \"http://weblogs.asp.net/fbouma/archive/2013/08/13/windows-store-account-getting-rid-of-it-is-as-hard-as-signing-up.aspx\"}, {\"v\": \"Windows Store dev account: getting rid of it is as hard as signing up \"}, {\"v\": 51}]}, {\"c\": [{\"v\": \"http://www.forbes.com/sites/andyellwood/2012/01/18/being-a-regular/\"}, {\"v\": \"Being A Regular\"}, {\"v\": 28}]}, {\"c\": [{\"v\": \"\"}, {\"v\": \"Ask HN: The Road to Becoming an Angel or VC\"}, {\"v\": 12}]}, {\"c\": [{\"v\": \"http://personalmba.com/best-business-books/\"}, {\"v\": \"Best Business Books\"}, {\"v\": 20}]}, {\"c\": [{\"v\": \"http://www.couch.io/migrating-to-couchdb\"}, {\"v\": \"Migrating to CouchDB\"}, {\"v\": 48}]}], \"cols\": [{\"type\": \"string\", \"id\": \"url\", \"label\": \"url\"}, {\"type\": \"string\", \"id\": \"title\", \"label\": \"title\"}, {\"type\": \"number\", \"id\": \"score\", \"label\": \"score\"}]},\n",
       "            {\n",
       "              pageSize: 25,\n",
       "              cssClassNames:  {\n",
       "                tableRow: 'gchart-table-row',\n",
       "                headerRow: 'gchart-table-headerrow',\n",
       "                oddTableRow: 'gchart-table-oddrow',\n",
       "                selectedTableRow: 'gchart-table-selectedrow',\n",
       "                hoverTableRow: 'gchart-table-hoverrow',\n",
       "                tableCell: 'gchart-table-cell',\n",
       "                headerCell: 'gchart-table-headercell',\n",
       "                rowNumberCell: 'gchart-table-rownumcell'\n",
       "              }\n",
       "            },\n",
       "            {source_index: 0, fields: 'url,title,score'},\n",
       "            0,\n",
       "            10);\n",
       "        }\n",
       "      );\n",
       "    </script>\n",
       "  "
      ],
      "text/plain": [
       "QueryResultsTable job_xEjVjj18rbx51BN1gSFipUuGdSE"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%bq query\n",
    "SELECT\n",
    "  url, title, score\n",
    "FROM\n",
    "  `bigquery-public-data.hacker_news.stories`\n",
    "WHERE\n",
    "  LENGTH(title) > 10\n",
    "  AND score > 10\n",
    "LIMIT 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL. For example, if the url is http://mobile.nytimes.com/...., I want to be left with <i>nytimes</i>. To ensure that the parsing works for all URLs of interest, I'll group by the source to make sure there are no weird names left. This was an iterative process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "query=\"\"\"\n",
    "SELECT\n",
    "  ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,\n",
    "  COUNT(title) AS num_articles\n",
    "FROM\n",
    "  `bigquery-public-data.hacker_news.stories`\n",
    "WHERE\n",
    "  REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')\n",
    "  AND LENGTH(title) > 10\n",
    "GROUP BY\n",
    "  source\n",
    "ORDER BY num_articles DESC\n",
    "LIMIT 10\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source</th>\n",
       "      <th>num_articles</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>blogspot</td>\n",
       "      <td>41386</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>github</td>\n",
       "      <td>36525</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>techcrunch</td>\n",
       "      <td>30891</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>youtube</td>\n",
       "      <td>30848</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>nytimes</td>\n",
       "      <td>28787</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>medium</td>\n",
       "      <td>18422</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>google</td>\n",
       "      <td>18235</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>wordpress</td>\n",
       "      <td>17667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>arstechnica</td>\n",
       "      <td>13749</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>wired</td>\n",
       "      <td>12841</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        source  num_articles\n",
       "0     blogspot         41386\n",
       "1       github         36525\n",
       "2   techcrunch         30891\n",
       "3      youtube         30848\n",
       "4      nytimes         28787\n",
       "5       medium         18422\n",
       "6       google         18235\n",
       "7    wordpress         17667\n",
       "8  arstechnica         13749\n",
       "9        wired         12841"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import google.datalab.bigquery as bq\n",
    "df = bq.Query(query).execute().result().to_dataframe()\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for machine learning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source</th>\n",
       "      <th>title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>github</td>\n",
       "      <td>Opinionated Dress Color Simulator</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>techcrunch</td>\n",
       "      <td>Meteor Raises $20M</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>techcrunch</td>\n",
       "      <td>DataSift Raises $15M To Help Businesses Mine A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>github</td>\n",
       "      <td>Writing Cross-Platform Games in Rust Using Piston</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>github</td>\n",
       "      <td>DAws   Advanced Web Shell</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       source                                              title\n",
       "0      github                  Opinionated Dress Color Simulator\n",
       "1  techcrunch                                 Meteor Raises $20M\n",
       "2  techcrunch  DataSift Raises $15M To Help Businesses Mine A...\n",
       "3      github  Writing Cross-Platform Games in Rust Using Piston\n",
       "4      github                          DAws   Advanced Web Shell"
      ]
     },
     "execution_count": 92,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query=\"\"\"\n",
    "SELECT source, REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ') AS title FROM\n",
    "(SELECT\n",
    "  ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,\n",
    "  title\n",
    "FROM\n",
    "  `bigquery-public-data.hacker_news.stories`\n",
    "WHERE\n",
    "  REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')\n",
    "  AND LENGTH(title) > 10\n",
    ")\n",
    "WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')\n",
    "\"\"\"\n",
    "df = bq.Query(query + \" LIMIT 10\").execute().result().to_dataframe()\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For ML training, we will need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset).  A simple way to do this is to use the hash of a well-distributed column in our data (See https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning).\n",
    "<p>\n",
    "So, let's do that and save the results as CSV files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source</th>\n",
       "      <th>title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>github</td>\n",
       "      <td>Awesome per directory history for ZSH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>github</td>\n",
       "      <td>PHP class which implements the Elo rating system</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>github</td>\n",
       "      <td>Comic Sans Everything</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>github</td>\n",
       "      <td>Amazing Flat version of Twitter Bootstrap</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>github</td>\n",
       "      <td>A year of fun and hard work on learning Dart</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   source                                             title\n",
       "0  github             Awesome per directory history for ZSH\n",
       "1  github  PHP class which implements the Elo rating system\n",
       "2  github                             Comic Sans Everything\n",
       "3  github         Amazing Flat version of Twitter Bootstrap\n",
       "4  github      A year of fun and hard work on learning Dart"
      ]
     },
     "execution_count": 93,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "traindf = bq.Query(query + \" AND ABS(MOD(FARM_FINGERPRINT(title), 4)) > 0\").execute().result().to_dataframe()\n",
    "evaldf  = bq.Query(query + \" AND ABS(MOD(FARM_FINGERPRINT(title), 4)) = 0\").execute().result().to_dataframe()\n",
    "traindf.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "github        27445\n",
       "techcrunch    23131\n",
       "nytimes       21586\n",
       "Name: source, dtype: int64"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "traindf['source'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "github        9080\n",
       "techcrunch    7760\n",
       "nytimes       7201\n",
       "Name: source, dtype: int64"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaldf['source'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "traindf.to_csv('train.csv', header=False, index=False, encoding='utf-8', sep='\\t')\n",
    "evaldf.to_csv('eval.csv', header=False, index=False, encoding='utf-8', sep='\\t')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "github\tAwesome per directory history for ZSH\n",
      "github\tPHP class which implements the Elo rating system\n",
      "github\tComic Sans Everything\n"
     ]
    }
   ],
   "source": [
    "!head -3 train.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  24041 eval.csv\n",
      "  72162 train.csv\n",
      "  72164 vocab.csv\n",
      " 168367 total\n"
     ]
    }
   ],
   "source": [
    "!wc -l *.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Copying file://eval.csv [Content-Type=text/csv]...\n",
      "Copying file://train.csv [Content-Type=text/csv]...                             \n",
      "Copying file://vocab.csv [Content-Type=text/csv]...                             \n",
      "\\ [3 files][  9.4 MiB/  9.4 MiB]                                                \n",
      "Operation completed over 3 objects/9.4 MiB.                                      \n"
     ]
    }
   ],
   "source": [
    "%bash\n",
    "gcloud storage cp *.csv gs://${BUCKET}/txtcls1/"   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2> TensorFlow code </h2>\n",
    "\n",
    "Please explore the code in this <a href=\"txtcls1/trainer\">directory</a> -- <a href=\"txtcls1/trainer/model.py\">model.py</a> contains the key TensorFlow model and <a href=\"txtcls1/trainer/task.py\">task.py</a> has a main() that launches off the training job.\n",
    "\n",
    "However, the following cells should give you an idea of what the model code does:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.2.1\n",
      "12 words into vocab.tsv\n",
      "Some title --> [8 4]\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "from tensorflow.contrib import lookup\n",
    "from tensorflow.python.platform import gfile\n",
    "\n",
    "print tf.__version__\n",
    "MAX_DOCUMENT_LENGTH = 5  \n",
    "PADWORD = 'ZYXW'\n",
    "\n",
    "# vocabulary\n",
    "lines = ['Some title', 'A longer title', 'An even longer title', 'This is longer than doc length']\n",
    "\n",
    "# create vocabulary\n",
    "vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)\n",
    "vocab_processor.fit(lines)\n",
    "with gfile.Open('vocab.tsv', 'wb') as f:\n",
    "    f.write(\"{}\\n\".format(PADWORD))\n",
    "    for word, index in vocab_processor.vocabulary_._mapping.iteritems():\n",
    "      f.write(\"{}\\n\".format(word))\n",
    "N_WORDS = len(vocab_processor.vocabulary_)\n",
    "print '{} words into vocab.tsv'.format(N_WORDS)\n",
    "\n",
    "# can use the vocabulary to convert words to numbers\n",
    "table = lookup.index_table_from_file(\n",
    "  vocabulary_file='vocab.tsv', num_oov_buckets=1, vocab_size=None, default_value=-1)\n",
    "numbers = table.lookup(tf.constant(lines[0].split()))\n",
    "with tf.Session() as sess:\n",
    "  tf.tables_initializer().run()\n",
    "  print \"{} --> {}\".format(lines[0], numbers.eval())   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ZYXW\n",
      "A\n",
      "even\n",
      "longer\n",
      "title\n",
      "This\n",
      "doc\n",
      "is\n",
      "Some\n",
      "An\n",
      "length\n",
      "than\n",
      "<UNK>\n"
     ]
    }
   ],
   "source": [
    "!cat vocab.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "titles= ['Some title' 'A longer title' 'An even longer title'\n",
      " 'This is longer than doc length'] (4,)\n",
      "words= SparseTensorValue(indices=array([[0, 0],\n",
      "       [0, 1],\n",
      "       [1, 0],\n",
      "       [1, 1],\n",
      "       [1, 2],\n",
      "       [2, 0],\n",
      "       [2, 1],\n",
      "       [2, 2],\n",
      "       [2, 3],\n",
      "       [3, 0],\n",
      "       [3, 1],\n",
      "       [3, 2],\n",
      "       [3, 3],\n",
      "       [3, 4],\n",
      "       [3, 5]]), values=array(['Some', 'title', 'A', 'longer', 'title', 'An', 'even', 'longer',\n",
      "       'title', 'This', 'is', 'longer', 'than', 'doc', 'length'], dtype=object), dense_shape=array([4, 6]))\n",
      "dense= [['Some' 'title' 'ZYXW' 'ZYXW' 'ZYXW' 'ZYXW']\n",
      " ['A' 'longer' 'title' 'ZYXW' 'ZYXW' 'ZYXW']\n",
      " ['An' 'even' 'longer' 'title' 'ZYXW' 'ZYXW']\n",
      " ['This' 'is' 'longer' 'than' 'doc' 'length']] (?, ?)\n",
      "numbers= [[ 8  4  0  0  0  0]\n",
      " [ 1  3  4  0  0  0]\n",
      " [ 9  2  3  4  0  0]\n",
      " [ 5  7  3 11  6 10]] (?, ?)\n",
      "padding= [[0 0]\n",
      " [0 5]] (2, 2)\n",
      "padded= [[ 8  4  0  0  0  0  0  0  0  0  0]\n",
      " [ 1  3  4  0  0  0  0  0  0  0  0]\n",
      " [ 9  2  3  4  0  0  0  0  0  0  0]\n",
      " [ 5  7  3 11  6 10  0  0  0  0  0]] (?, ?)\n",
      "sliced= [[ 8  4  0  0  0]\n",
      " [ 1  3  4  0  0]\n",
      " [ 9  2  3  4  0]\n",
      " [ 5  7  3 11  6]] (?, 5)\n"
     ]
    }
   ],
   "source": [
    "# string operations\n",
    "titles = tf.constant(lines)\n",
    "words = tf.string_split(titles)\n",
    "densewords = tf.sparse_tensor_to_dense(words, default_value=PADWORD)\n",
    "numbers = table.lookup(densewords)\n",
    "\n",
    "# now pad out with zeros and then slice to constant length\n",
    "padding = tf.constant([[0,0],[0,MAX_DOCUMENT_LENGTH]])\n",
    "padded = tf.pad(numbers, padding)\n",
    "sliced = tf.slice(padded, [0,0], [-1, MAX_DOCUMENT_LENGTH])\n",
    "\n",
    "with tf.Session() as sess:\n",
    "  tf.tables_initializer().run()\n",
    "  print \"titles=\", titles.eval(), titles.shape\n",
    "  print \"words=\", words.eval()\n",
    "  print \"dense=\", densewords.eval(), densewords.shape\n",
    "  print \"numbers=\", numbers.eval(), numbers.shape\n",
    "  print \"padding=\", padding.eval(), padding.shape\n",
    "  print \"padded=\", padded.eval(), padded.shape\n",
    "  print \"sliced=\", sliced.eval(), sliced.shape \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "def init(bucket, num_steps):\n",
      "def save_vocab(trainfile, txtcolname, outfilename):\n",
      "def read_dataset(prefix):\n",
      "def cnn_model(features, target, mode):\n",
      "def serving_input_fn():\n",
      "def get_train():\n",
      "def get_valid():\n",
      "def experiment_fn(output_dir):\n"
     ]
    }
   ],
   "source": [
    "%bash\n",
    "grep \"^def\" txtcls1/trainer/model.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's make sure the code works locally on a small dataset for a few steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "%bash\n",
    "echo \"bucket=${BUCKET}\"\n",
    "rm -rf outputdir\n",
    "export PYTHONPATH=${PYTHONPATH}:${PWD}/txtcls1\n",
    "python -m trainer.task \\\n",
    "   --bucket=${BUCKET} \\\n",
    "   --output_dir=outputdir \\\n",
    "   --job-dir=./tmp --train_steps=200"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When I ran it, I got a 41% accuracy after a few steps. Because batchsize=32, 200 steps is essentially 6400 examples -- the full dataset is 72,000 examples, so this is not even the full dataset. And already, we are doing better than random chance.\n",
    "<p>\n",
    "Once the code works in standalone mode, you can run it on Cloud ML Engine. You can monitor the job from the GCP console in the Cloud Machine Learning Engine section.  Since we have 72,000 examples and batchsize=32, train_steps=36,000 essentially means 16 epochs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "%bash\n",
    "OUTDIR=gs://${BUCKET}/txtcls1/trained_model\n",
    "JOBNAME=txtcls_$(date -u +%y%m%d_%H%M%S)\n",
    "echo $OUTDIR $REGION $JOBNAME\n",
    "gcloud storage rm --recursive --continue-on-error $OUTDIR\n",    "gcloud storage cp txtcls1/trainer/*.py $OUTDIR\n",    "gcloud ml-engine jobs submit training $JOBNAME \\\n",
    "   --region=$REGION \\\n",
    "   --module-name=trainer.task \\\n",
    "   --package-path=$(pwd)/txtcls1/trainer \\\n",
    "   --job-dir=$OUTDIR \\\n",
    "   --staging-bucket=gs://$BUCKET \\\n",
    "   --scale-tier=BASIC --runtime-version=1.2 \\\n",
    "   -- \\\n",
    "   --bucket=${BUCKET} \\\n",
    "   --output_dir=${OUTDIR} \\\n",
    "   --train_steps=36000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Training finished with an accuracy of 73%.  Obviously, this was trained on a really small dataset and with more data will hopefully come even greater accuracy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2> Deploy trained model </h2>\n",
    "<p>\n",
    "Deploying the trained model to act as a REST web service is a simple gcloud call."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gs://cloud-training-demos-ml/txtcls1/trained_model/export/Servo/\n",
      "gs://cloud-training-demos-ml/txtcls1/trained_model/export/Servo/1499296055/\n"
     ]
    }
   ],
   "source": [
    "%bash\n",
    "gcloud storage ls gs://${BUCKET}/txtcls1/trained_model/export/Servo/"   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "%bash\n",
    "MODEL_NAME=\"txtcls\"\n",
    "MODEL_VERSION=\"v1\"\n",
    "MODEL_LOCATION=$(gcloud storage ls gs://${BUCKET}/txtcls1/trained_model/export/Servo/ | tail -1)\n",    "echo \"Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes\"\n",
    "#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME}\n",
    "#gcloud ml-engine models delete ${MODEL_NAME}\n",
    "gcloud ml-engine models create ${MODEL_NAME} --regions $REGION\n",
    "gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2> Use model to predict </h2>\n",
    "<p>\n",
    "Send a JSON request to the endpoint of the service to make it predict which publication the article is more likely to run in. These are actual titles of articles in the New York Times, github, and TechCrunch on June 19.   These titles were not part of the training or evaluation datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "response={u'predictions': [{u'source': u'nytimes', u'prob': [0.7775614857673645, 5.86951500736177e-05, 0.22237983345985413], u'class': 0}, {u'source': u'github', u'prob': [0.1087314561009407, 0.8909648656845093, 0.0003036781563423574], u'class': 1}, {u'source': u'techcrunch', u'prob': [0.0021869686897844076, 1.563105769264439e-07, 0.9978128671646118], u'class': 2}]}\n"
     ]
    }
   ],
   "source": [
    "from googleapiclient import discovery\n",
    "from oauth2client.client import GoogleCredentials\n",
    "import json\n",
    "\n",
    "credentials = GoogleCredentials.get_application_default()\n",
    "api = discovery.build('ml', 'v1beta1', credentials=credentials,\n",
    "            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1beta1_discovery.json')\n",
    "\n",
    "request_data = {'instances':\n",
    "  [\n",
    "      {\n",
    "        'title': 'Supreme Court to Hear Major Case on Partisan Districts'\n",
    "      },\n",
    "      {\n",
    "        'title': 'Furan -- build and push Docker images from GitHub to target'\n",
    "      },\n",
    "      {\n",
    "        'title': 'Time Warner will spend $100M on Snapchat original shows and ads'\n",
    "      },\n",
    "  ]\n",
    "}\n",
    "\n",
    "parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'txtcls', 'v1')\n",
    "response = api.projects().predict(body=request_data, name=parent).execute()\n",
    "print \"response={0}\".format(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the trained model predicts that the Supreme Court article is 78% likely to come from New York Times and 22% from TechCrunch. The Docker article is 89% likely to be from GitHub according to the service and the Time Warner one is 100% likely to be from TechCrunch."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
